Shopify Summer 2021 Data Science Challenge

By: Jad El-Asmar

Part 1

In this notebook, I will explore a dataset of sneaker orders from 100 sneaker shops. I will try to find out the reason behind a high average order value (AOV) of $3145.13 over a 30 day window. Then, I will determine a more accurate metric that would represent more this dataset.

I will be answering the following questions: Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?

a) Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

Local data path

Read dataset

Data profiling for the current dataframe

By looking at the profile report and analyzing each variable, I noticed something strange in the order_amount variable. Indeed, the maximal value observed is 704 000 and the minimal value is 90. There is a major difference in both of these extremums. Also, the variable total_items had a major difference in its extremums with a minimum of 1 and a maximum of 2000 items.

By looking at the warnings generated by the profile report, I noticed that there was a high correlation between the variables I mentionned. Indeed, by looking at the correaltion matrices like Pearson's, we can clearly see a strong correlation between the order_amount and the total_items. This is expected since the order_amount is obtained by a product of the total_items with the price of the shoe that we do not have in this dataset.

Therefore, I decided to explore in more details the order_amount variable. It will probably lead to the same conclusions as analyzing the total_items variable because of their strong correlation.

View missing values

Before exploring a dataset, it is always a good idea to look if there are any missing values. Even though, I could clearly see in the profile report that there are not missing values. I wanted to showcase a library I like using to visualize my data to observe how many missing values are present.

By looking at the previous diagram, we can see that there are no missing values in the dataset.

Here is an example of a dataset I worked with that had a lot of missing values.

image.png

The white lines represent the missing data points. We can clearly see a difference with the current shopify orders dataset that had a diagram with no missing data points

Visual representations

In this section, I decided to visualize the order_amount values in more details.

Scatter plot

As we can see, most of the orders have an order_amount less than 100 000. We can clearly see one data point around 150 000 and a couple of data points at arounf 700 000.

Plotly

Interactive data representation using Dash

Do not run this code, it was only to visualize the data in an interactive way, this can be viewed in the following video. If you wish to try it, just run this code but keep in mind that the notebook will be in a blocked state and you would have to stop it to run other cells

Let's see how many orders have an order amount equal to 7040000.

There are 17 orders with this outlier amount. We can also see that all these orders were made by using a credit card. Also, these were done by the same user with the id 607, the total amount of items ordered is 2000 and the order was always made at the same time 4:00:00 for different days in the month of march in the year 2017. The amount of items is way too big to be a normal order. Usually a regular user won't buy more than five pair of shoes.

Since 17 orders were made in the same month with an amount that big, this would explain why the average order value is $3145.13. Indeed, by looking at the profile report generated, I could see this value as the mean of the column order_amount.

Using Z Score

The Z score is used to represent the relationship between the data point (observation) and the mean as well as the standard deviation of the dataset. The purpose is to find the distribution of data with a mean of 0 and a standard deviation of 1. This represents a normal distribution. Indeed, the Z-score consists of re-scaling and centralize the data to detect data points that have values far from zero.

The Z score can be represented by the following formula:

Z score = (Observation — Mean)/Standard Deviation

Interquartile Range (IQR)

b) What metric would you report for this dataset?

We can see that the mean is affected by these outliers.

Mean before removing the outliers:

Mean after removing the outliers:

Clearly, the mean value is not an appropriate metric since it is highly impacted by the 17 values with \$ 70 4000 as order amount. The mean without the outliers is \\$ 754.09 which represents less than 25 percent of the value before removing the outiers \$ 3145.13 .

Mode before removing the outliers:

Mode after removing the outliers:

Median before removing the outliers:

Median after removing the outliers:

Mean, median and mode are measures of central tendency. But, the mean is the only value that is affected by the values of outliers. As we saw, the mean was more than four times bigger because of the outliers taken into consideration, but it was not the case for the mode neither the median.

c) What is its value

We can clearly see that the median and the mode are not affected by the outliers. Indeed, I calculated both of these metrics before and after removing the outliers and their value is identical. It is not the case for the mean. For the mode, it's value is 153 and for the median, it is 284.

Part 2

a. How many orders were shipped by Speedy Express in total?

First, here the amount of orders interests us. Therefore, I am going to use COUNT() a function that returns the number of rows that matches a specified criterion. This function will be applied on the OrderID since it is the unique value and is considered as the primary key.

To access the Shipper's name, I have to access another table, the Shippers Table. I need to find the common information that would allow me to link the Orders table and the Shippers'. By looking at both tables, I can see that the common information is the ShipperID.

By calling, the INNER JOIN keyword, it selects records that have matching values in both tables.

Finally, I specify the condition using the WHERE clause to filter the records. In this case, I specified the ShipperName as "Speedy Express". Note that I am using aliases to specify the table the column is associated to to avoid any ambiguity.

SELECT COUNT(o.OrderID) FROM Orders as o

INNER JOIN Shippers as s ON o.ShipperID = s.ShipperID

WHERE s.ShipperName = "Speedy Express"

The result of this SQL query indicates that 54 orders were shipped by Speedy Express.

image.png

b. What is the last name of the employee with the most orders?

In this case, I would like the last name of the employee which is a column present in the Employees table. But, I also need the quantity of orders per employee. Therefore, I also need information from the Orders table. To link both of these tables I call the INNER JOIN keywords on the Employee's ID. To merge all the orders with the same Employee ID together, I called the GROUP BY statement. It groups rows that have the same values into summary rows. In my case, It would group all the orders with the same ID together.

Finally, to have the highest amount of orders, I call the MAX() function on the count of orders total_count and retrieve the LastName associated to this value.

SELECT LastName, MAX(total_count) FROM (SELECT LastName,

COUNT(*) AS total_count FROM Employees AS e INNER JOIN Orders

AS o on e.EmployeeID = o.EmployeeID GROUP BY o.EmployeeID)

The result of this SQL query indicates that the last name of the employee with the most orders is Peacock with 40 orders.

image.png

c. What product was ordered the most by customers in Germany?

In this case, I want the ProductName which is present in the Products table. I also need to know the amount of times this product was ordered. This can be interpreted in two ways.

1.If by "ordered the most", we only take the amount of orders with that product, than the query would be as follow:

SELECT ProductName, MAX(total_orders) FROM

(SELECT ProductName, COUNT(*) AS total_orders,

Country FROM Customers AS c

INNER JOIN Orders as o on c.CustomerID = o.CustomerID

INNER JOIN OrderDetails as d on o.OrderID = d.OrderID

INNER JOIN Products as p on p.ProductID = d.ProductID

WHERE c.Country = "Germany"

GROUP BY p.ProductID)

The result of this SQL query indicates that the product that was ordered the most by customers in Germany is Gorgonzola Telino. It was ordered 5 times.

image.png

  1. If we are taking the quantity of the product ordered in each order, than we have to take in consideration the quantity. Therefore, I would obtain the following query:

SELECT ProductName, MAX(total_quantity) FROM (SELECT

ProductName, SUM(Quantity) AS total_quantity,

Country FROM Customers AS c

INNER JOIN Orders as o on c.CustomerID = o.CustomerID

INNER JOIN OrderDetails as d on o.OrderID = d.OrderID

INNER JOIN Products as p on p.ProductID = d.ProductID

WHERE c.Country = "Germany"

GROUP BY p.ProductID)

In this case, the product ordered the most is the Boston Crab Meat with a total of 160 times.

image.png

Both of these queries are very similar. For the first query, I went for amount of orders with a product. In this case, I used the function COUNT() to get the amount of orders. To link the Customers table with the Orders table, I used the keyword INNER JOIN on the CustomerID. After, to obtain the information about the product, I needed to link the Orders Table to another one. It will be the OrderDetails Table. Once again, I called the INNER JOIN keyword now on the ProductID. To only keep the rows with the counrty Germany, I used the WHERE clause. I ended the query by calling the GROUP BY statement to merge all the rows with the same ProductID.

The second query is the same but instead of going for the amount of orders, I requested the SUM() of the Quantity column.